Skip to content

feat: Perform tolerance-based comparison for lists and arrays#19

Open
MariusMerkleQC wants to merge 7 commits intomainfrom
list_arr
Open

feat: Perform tolerance-based comparison for lists and arrays#19
MariusMerkleQC wants to merge 7 commits intomainfrom
list_arr

Conversation

@MariusMerkleQC
Copy link
Collaborator

@MariusMerkleQC MariusMerkleQC commented Mar 19, 2026

Motivation

Partially addresses #8. Sequences (lists and arrays) are compared element-wise, iterating over each position in the sequence. Each element is then compared using the standard type-aware logic, so absolute/relative float tolerances and absolute temporal tolerances all apply naturally.

Maximum sequence length

An array's length (or shape for multi-dimensional arrays) is known statically from its data type. When at least one of the two compared columns is an array, its length determines the number of elements to compare. When both columns are lists, the maximum list length must be computed at runtime — this is handled by the cached property _max_list_lengths_by_column, a dictionary mapping column names to their maximum list length, populated only for columns that are pl.List in both data frames. The resolved max_list_length: int | None is then passed to condition_equal_columns().

Sequences of different lengths

Arrays have a fixed length, so comparing two arrays of different shapes can immediately return False. In all other cases, lengths may vary row-by-row, which is captured in the has_same_length expression. To avoid out-of-bound errors when indexing into shorter lists, null_on_oob=True is used instead of raising. The final result combines has_same_length with elements_match (the element-wise comparison), so rows with mismatched lengths are marked as unequal.

Multi-dimensional sequences

Nested sequences (e.g., lists of lists or multi-dimensional arrays) are handled recursively: outer elements are extracted positionally, then compared via the same _compare_columns logic until primitive types are reached. When both sides are lists at an inner nesting level, no max_list_length is available, so the comparison falls back to direct equality without element-wise unrolling (i.e., tolerances do not apply at inner list levels).

Changes

  • add cached property _max_list_lengths_by_column: dict[str, int]
  • introduce function _compare_sequence_columns() to compare lists and arrays with each other
  • add extensive test coverage
    • Modify test_condition_equal_columns_list_array_{equal_exact -> with_tolerance} to reflect the updated logic
    • Build on the former test with nested sequence types in test_condition_equal_columns_nested_list_array_with_tolerance
    • test comparison of two list columns in test_condition_equal_columns_two_lists, including empty lists, lists with None and None
    • cover mismatches of lengths in test_condition_equal_columns_array_vs_list_length_mismatch
    • cover mismatching array shapes in test_condition_equal_columns_two_arrays_different_shapes
    • handle the edge case of empty arrays and lists in test_condition_equal_columns_empty_arrays and test_condition_equal_columns_empty_lists, respectively

@MariusMerkleQC MariusMerkleQC self-assigned this Mar 19, 2026
@github-actions github-actions bot added the enhancement New feature or request label Mar 19, 2026
@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (2ae4c11) to head (955517c).

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #19   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines          707       743   +36     
=========================================
+ Hits           707       743   +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Quantco Quantco deleted a comment from Copilot AI Mar 20, 2026
@Quantco Quantco deleted a comment from Copilot AI Mar 20, 2026
@Quantco Quantco deleted a comment from Copilot AI Mar 20, 2026
Comment on lines +175 to +176
if isinstance(lhs_type, pl.List) and isinstance(rhs_type, pl.List):
assert actual.to_list() == [True, False, False]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this, I already had a solution that computes the maximum list length among all nesting levels in a "data type tree". For example, if you had a list of lists where the inner lists are longer than the outer lists, max_list_length would be the value of the inner list length. As this increases complexity even more, I'd like to first get this to main and implement it in a follow-up PR.

@MariusMerkleQC MariusMerkleQC marked this pull request as ready for review March 20, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Properly perform floating point comparisons for structs and lists

2 participants